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5. Amendments to the claims of the International application under PCT Article 34: 
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9. [x] A translation of the annexes to the international preliminary examination 
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11. An International Search Report (PCT/ISA/210) 
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b. [ ] has been transmitted by the International Bureau. 
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States International Searching Authority. 
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listed. 
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Basic National Fee 


Fee 




IPEA - US 








$670.00 






ISA - US 








$760.00 






PTO not ISA or IPEA 






$970.00 






Claims meet PCT Art. 33(1)- 
(4) - IPEA - US 




$96.00 






Filing with EPO or JPO search 
report 




$840.00 


$840.00 


w ■ 


Enter appropriate basic fee -* 


$840.00 




Claims* 


Number 
filed 




Number extra 


Rate 




a? 1 ": 


Total claims 


9 


-20 


0 


$18.00 


$ 


ste : 


Independent claims 


1 


-3 


0 


$78.00 


$ 




Multiple dependent claims (if applicable) 


$260.00 






Total of above 


$840 00 




Small entity statement enclosed, 1 if Yes, 0 if No -» 


0 


$0.00 




Total national fee 


$840.00 




Fee for recording enclosed assignment 


$40.00 






Total fees enclosed 


$840.00 



*After any attached preliminary amendment reducing the number of claims and/or 
deleting multiple dependencies. 



[x] A check in the amount of $ 840.00 to cover the above fees is 
enclosed. 



[ ] Please charge our Deposit Account No. 18-0988 in the amount of 
$ . A duplicate copy of this sheet is enclosed. 
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1 6. The Commissioner is hereby authorized to charge the following additional fees 
that may be required by this paper and during the entire pendency of this 
application to our Deposit Account No. 18-0988: 

a. [X] 37 CFR 1.492(a)(1), (2), (3), (4) and (5) (filing fees) 

WARNING: BECAUSE FAILURE TO PAY THE NATIONAL FEE WITHIN 30 MONTHS WITHOUT EXTENSION 
(37 CFR S 1.495(B)(2)) RESULTS IN ABANDONMENT OF THE APPLICATION, IT WOULD BE BEST TO 
ALWAYS CHECK THE ABOVE BOX. 

b. [ ] 37 CFR 1.492(b), (c) and (d) (presentation of extra claims) 

NOTE: Because additional fees for excess or multiple dependent claims not paid on filing or on later 
presentation must only be paid or these claims cancelled by amendment prior to the expiration of the time 
period set for response by the PTO in any notice of fee deficiency (37 CFR 1 .492(d)), it might be best not 
to authorize the PTO to charge additional claim fees, except possibly when dealing with amendments after 
final action. 



Respectfully submitted, 
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1621 Euclid Avenue, 19th Floor 
Cleveland, Ohio 44115 

Tel: 216-621-1113 Fax: 216-621-6165 
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DESCRIPTION 



Scoring of Test Units 



5 TECHNICAL FIELD 

The invention relates to a method and a system for scoring 
text units (e.g. sentences) , for example according to 
their contribution in defining the meaning of a source 
text (textual relevanae) , their ability to form a 
10 cohesive subtext (textual connectivity) or the extent and 
effectiveness to which they address the different topics 
which characterise the subject matter of the teixt ( topic 
aptness). 

15 BACKGROUND ART 

When abridging a text it is desirable to seleat a portion 
of the text which is most representative in that it 
contains as many of the key concepts defining the text 
as possible (textual relevance). As an example, in 

20 EP-A-741364 (Xerox Corp. ) disclosed a method of selecting 
key phrases from a machine readable document by (a) 
generating from the document a multiplicity of candidate 



phrases (units of more than one word), followed by (b) 



i/a 



selecting as key phrases a subset o£ the candidate phrases . 
This selection, known as summarisation, may also take into 
consideration the degree of textual connectivity among 
sentences so as to minimise the danger of producing 
summaries which contain poorly linked sentences. 
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Computing lexical cohesion for all pair-wise text unit 
combinations in a text provides an effective way of 
assessing textual relevance and connectivity in paral lei, 
see for example Hoey M- (1991) Patterns of Lexis in Text. 
OUP, Oxford, UK; and Collier A. (1994) A System for 
Automatic Concordance Line Selection- NEMLAP 1994, 
Manchester, UK, A simple way of computing a lexical 
cohesion for a pair of text units is to count non-stop 
words which occur in both text units. Non-stop words can 
be intuitively thought of as words which have high 
informational content. They usually exclude words with 
a very high frequency of occurrence 4 e.g. closed class 
words such as determiners, preposition and conjunction© , 
see for example. Pox, c. (1992) Lexical Analysis and 
Stoplists* in Frakes W and Baeza- Yates R (fids) 
Information Retrieval: Data Structures & Algorithms. 
Prentice Hall, Upper Saddle River # NJ, USA, pp 102-130. 

A sample list of stop words is given below:- 

a about above across after again against all almost alone 
along already also although always among and another any 
anybody anyone anything anywhere are area areas around 
as ask asked asking asks at away b back backed backing 
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backs be became because become becomes been before begun 
behind being beings best better between big both but by 
c came can cannot case cases certain certainly clear 
clearly come could d did differ different differently 

do does done down downed downing - 7 v very w 

want wanted wanting wants was way ways we well 
wells went were what when where whether which while 
who whole whose why will with within without work 
worked working works would x y year years yet you 
young younger youngest your yours z 

Text units which contain a greater number of shared 
non-stop words are more likely to provide a better 
abridgement of the original text for two reasons t 

the more often a word with high informational content 
occurs in a text , the more topical and germane 
to the text the word is likely to be, and 

the greater the times two text units share a word, 
the more connected they are- likely to be. 

As an illustrative example # consider the ranking of the 
following sample text, where digits surrounded by hash 
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characters (#) are text unit indexes* 

#1# Reports Apple looking for a Partner 

#2# NEW YORK (Reuter) - Apple is actively looking for 
a friendly merger partner, according to several 
executive* close to the company , the New York 
Times said In Thursday. 

#3# One executive who does business with Apple said 
Apple' employees told him the company was again 
in talks with Sun Microsystems, the paper said. 

#4# On Wednesday , Saudi Arabia s Prince Alwaleed Bin 
Talal Bin Abdul aziz Al Saud said he owned more 
than five percent of the computer maker s stock, 
recently buying shares on the open market for a 
total of 5115 million. 

#5# Oracle Corp Chairman Larry Ellison confirmed on 
March 27 he had formed an independent investor 
group to gauge interest in taking over Apple . 

#6# The company was not immediately available to 
comment . 

To compute lexical cohesion according to the method 
suggested by Hoey, (see above reference), all unique 
pairwlse combinations of text units are scored according 
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to how many words they share, as shown In the table 
below . 



Text unit pairs 


Words shared 


Score "~ 




#2# 


Apple, look, partner 


3 






#5# 


Apple, Apple 


2 




#1# 


#3# 


Apple, Apple 


2 




ma- 


#6# . 


company 


1 




#i# 


Mti 




0 






#5# 




0 




#1# 


#5# 


Apple 


1 




UAH 

♦ •Iff 


#6#. 


0 




#1# 


#6# 




0 




#s# 


#6# 




0 




#2# 


#3# 


Apple, Apple, executive, company 


4 




#2# 


#4# 


0 




#2# 


#5# 


Apple 


1 




#2# 


#6# 


company 


1 




#3# 


1/ Ml , 




0 





The number of shared words (including multiple occurren 
ces of the same word) In eaah text unit pair provides the 
individual score for that pair. For example, the individu 
al scores xor all pairs involving text unit #2# are:- 





#1# 


#2# 


#3# 




#5# 


#6*! 


#2# 


3 




4 


0 


1 


1 



Table 1 
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The final score for a given text unit is obtained by summing 
the individual scores for that text unit. According to 
Hoey (sec above reference), the number of links (e«g. 
shared words) across two text units must be above a certain 
threshold for the two text units to achieve a lexical 
cohesion rank- For example , if only individual scores 
greater than 2 are taken into account, the final score 
for text unit #2# is ( 3+4«) 7 * Proceeding in the same way, 
the final scores for text units #1# and #3# are 3 and 4 
respectively - 

Such a scoring provides the following ranking: 

firsts text unit #2# (final scores 7)y 
second: text unit #3# (final scores 4); and 
thirds text unit #1# (final score: 3), 

A text abridgement can be obtained by selecting text units 
in ranking order according to the text percentage 
specified by the user. For example, a 35% abridgement of 
the text (ie* an abridgement of up to 35% of the total 
number of text units in the sample text) would result in 
the selection of text units #2# and &3#. 
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Further details about lexical cohesion and the ways in 
which it can be used to aid summarisation can be found 
in Hoey and Collier references mentioned above. 

Other prior art on related technology includes , Doi ( 1991 ) 
Method and apparatus for producing an abstract of a 
document - -OS patent 50 77668; Ukita et al, (1993) Digital 
Computing Apparatus for Preparing Document Text - us 
patent 52S71B6 ; Withgott et al. Method and appear at us for 
Summarising documents according to theme - US patent 
5364703; and Padersen, J. & J. Tukey (1997) Method and 
Apparatus for Automatic Document Summarisation - US 
patent 5638543. 

DISCLOSURE OF INVENTION 

It is an object o£ the invention to provide a method and 
system for ranking text units which overcomes at least 
some of the die advantages of the prior art. 

According to the invention there is provided a method of 
operating on a text including a plurality of text units, 
each including one or more strings, the method including 
the steps of: 
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forming a structure for each of at least some of said 
strings, in which structure the string is associated with 
each text unit in whlah the string occurs; 

for each text unit summing the number of occurrences of 
each other text unit in the same structure or structures 
so as to form an individual score for each pair of text 
units; and 

processing said individual scores for each text unit in 
order to form a final score for each text unit* 

The use of such structures considerably reduces the time 
taken to operate on the text because it is no longer 
necessary to count the number of strings shared between 
all possible pairs of text units in turn. 

More specifically, the degree of connectivity of a text 
unit with all other text units in a text can be simply 
assessed by quantifying the elements (e.g. words) which 
each text unit shares with pairs built by associating each 
element in the text with the list of pointers to the text 
units in which the element occurs* This provides a 
significant advantage in terms of processing speed when 
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compared to a method such as the one described by Hoey 
(X991) and Collier (1994) where the same assessment is 
carried out by computing all palrwise combinations of text 
units. In particular, the word-per-second processing 
rate is significantly less affected by text si&e. 

The method may include the further step of ranking the 
text units on the basis of said Individual scores. 

In one embodiment of the invention, said text units are 
sentences, said strings are words forming said sentences , 
and the method includes the additional steps of removing 
stop-words, stemming each remaining word and indexing the 
sentences prior to carrying out said summing step, and 
said structures are stem- index records each including a 
stemmed word and one ox: more indexes corresponding to 
sentences in which said stemmed word occurs. 

In an alternative embodimenx, said text is associated with 
a word text including words , each word being associated 
with one or more subject codes representing subjects with 
which said word is associated, and said strings are subject 
codes associated with said words. 
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In this case the method may comprise the further step of 
Keeping a record of the word spelling associated with each 
occurrence of a subject code in a text unit, and during 
said summing step disregarding occurrences of the same 
subject code in a pair of text unite 'if the same word 
spelling is associated with said same subject code in said 
pair of text units. 

It will be appreciated that each word may have a number 
of possible subject codes, some of which are contextually 
inappropriate for the context in which the ward is being 
used. The last -mentioned feature allows the method to 
perform disambiguation of the subject codes, by 
disregarding occurrences of subject cedes which . are 
contextually inappropriate, as will be described in 
greater detail below. 

Said step of disregarding occurrences of subject codes 
may not be carried out for subject codes which relate to 
only a single word spelling in the word text. 

Said processing step may include calculating a level for 
each text unit, in addition to said final score, and said 
level may indicate the value of the highest of said 
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individual scores in relation to a threshold value. 

This allows text units to be ranked first according to 
level, and second according to said final, score, if 
desired. 

The invention also provides a storage medium containing 
a program for controlling a programmable data processor 
to perform the method described above, 

The invention also provides a system for ranking text 
units in a text, the system including a data processor 
programmed to perform the steps of the method described 
above • 

BRIEF DESCRIPTION OF DRAWINGS 

Preferred embodiments of the invention will now be 
described , by way of example only, with reference to the 
accompanying drawings, in which: 

Figure 1 shows a flow chart outlining some of the 
steps involved in a preferred embodiment of the 
invention. 
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Figure 2 shows a flow chart which is a continuation of 
the flow chart of Figure 1? and 

Figure 3 shows an apparatus suitable for carrying out the 
5 - methods described below. 

BEST MODE FOR CARRYING OUT THE INVENTION 
In an embodiment of the invention described below the 
ranking of text units is carried out with reference to 
the presence of shared words across text units. The 
assessment of textual relevance and connectivity can both 
. be carried out by counting shared links (e.g. identical 
words) across all text unit pairs. The method makes it 
possible to perform this assessment by quantifying, the 
elements (e.g. words) which each text unit shares with 
stem-index pairs, each such pair comprising an element 
in the text and a list of pointers to the text units in 
which the element occurs. This technique makes it 
possible .to rank text units at a processing rate which 
is significantly less affected by text size than a system 
where the same assessment is carried out by computing all 
• pair wise combinations of text units. 



'is ; 



10 



15 



20 



The ranking is done by assessing 
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how germane each text unit is to the source text (textual 
relevance ) ? 

how well connected each text unit is to other text units 
in the source text (textual connectivity) ; and 

how well each text unit represents the various topics 
dealt with in the source text (topic aptness). 

In a, further embodiment described below , the same 
technique is used for assessment of topic aptness • Shared 
links across text units are verified in terms of 
overlapping semantic codes associated with words (e.g. 
the connotations business and government for the 
word executive) with reference to a dictionary or 
thesaurus database providing a specification of such 
codes for word entries - 

The method can be divided into two phases, namely a 
preparatory phase, followed by a ranking phase • In the 
preparatory phase the text undergoes a number of 
normalisations which have the purpose of facilitating the 
process of computing lexical cohesion. This phase 
includes the following operations: 

text segmentation; 
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removal of formatting commands; 
recognition of proper names; 
recognition of multi-word expressions y 
removal of stop words; and 
word tokenisation . 

Further ways of normalizing the input text are also 
mentioned later in the specification- 

The object of segmentation is to partition the input text 
into text units which stand on their own (e.g. sentences , 
titles , and section headings ) and to index such text units , 
for example as shown in the sample text given above. 

Next, formatting commands such as the HTML {hyper-text 
mark-up language) mark-ups in the text are dealt with. 

The sample text including HTML formatting commands looks 
like the following;- 

<h2>Reportt Apple Looking for a Partner </h2> 
< I Textstart --> 
<P> 

NEW YORK (Reuter) - Apple is actively looking for a 
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friendly merger partner, according to several executives 
close to the company, the New York Times said on Thurs 
day. 
<P> 

One executive who does business with^ Apple said Apple 
employees told him the company was again in talks with 
Sun Microsystems, the paper said. 

<p> 

On Wednesday , Saudi Arabia s Prince Alwaleed Bin 
Talal bin Abdulazis Al Saud said he owned more than five 
percent of the computer maker s stock, recently buying 
shares on the open market for a total of $115 million* 
<P> 

Oracle Corp Chairman Larry Ellison confirmed un March 
27 he had formed an independent investor group to gauge 
interest in taking over Apple. 
<P> 

The Company was not immediately available to comment . 
<!--TextEnd 

In the present embodiment, the formatting commands are 
simply removed, but alternative treatments are mentioned 
below . 
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A facility far recognizing proper names and multi-word 
expressions is also included. Such a facility makes it 
possible to process expressions such as Apple , New York, 
New York Times, gauge interest as single units which should 
not be further takenized. The recognition of such units 
ensures that expressions which superficially resemble 

a 

each other, but have different meanings - e.g. Apple (the 
company) and apple (the fruit ), or York in New York (the 
city) and New York Times (the newspaper) ~ do not actually 
generate lexical cohesion links . For further information 
relating to recognising proper nouns and multi-word 
expressions reference can be made respectively to David 
McDonald (1996) Internal and External Evidence in the 
Identification and Semantics Categorization of Proper 
Names, In B. Boguraev and J. Pustejovsky (eds) Corpus 
Processing for lexical Acquisition, MIT Press and Justeson, 
J. and Katz, S-M., 1995. Technical terminology! some 
linguistic properties and an alorithm for identification 
in text. In Natural Language Engineering, 1:9- -27. 

Next, all words in the input text which match stop words, 
such as those mentioned above, are removed- This step 
ensures that words which are low in informational content 
are not taken into account when assessing lexical 
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cohesion. After stop-word removal, the calculation of 
shared words aaross text units is further optimized by 
tokenlzing non-stop words . word tokenization is achieved 
by reducing words into stems or citation forms, e.g. 



-Input strings 


stems 


citation farms 


actively looking 


activ look 


active look 



Citation forms generally correspond to the manner in which 
words are listed in conventional dictionaries, and the 
process of reducing words to citation form is referred 
to as lemmatisation. Reduction of words to stem form 
generally involves a greater truncation of the word in 
which all inflections are removed. The purpose of 
reducing words of stems or citation forms is to achieve 
a more effective notion of word sharing, e.g. one which 
abstracts away from the effects of inflectional and/or 
derivational morphology. Stemming provides a very 
powerful word tokenization technique as it undoes both 
derivational and inflectional morphology. For example, 
stemming makes it possible to capture the similarity 
between the words nature, natural, naturally, naturalize, 
naturalizing as they all reduce to,. the stem natur. Word 
reduction to citation form would only capture the 
relationship between naturalize and naturalizing . In the 
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present embodiment , stemming will be used* For a de- 
scription of some stemming techniques reference can be 
made to, Frakes W, (1992) Stemming Algorithms, in Frakes 
W and Baeza*- Yates R (eds) Information Retrievals Data 
Structures & Algorithms. Prentice Hall, Upper Saddle 
River , NJ, USA, pp, 131-160- For further information 
relating to lemiuatisation reference can be made to Hadumod 
Bussmann (1996) Routledge Dictionary of Language and 
Linguistics, Routledge, London, P. 272 Following the 
stages of' stop-word removal and stemming , the sample text 
is as shown below. 



#1# report Apple look partner 

#Z# New-York Reuter Apple activ look friend merger 
partner accord 

exeaut close company New-York-Times Thursday 
#3# execut busy Apple Apple employ tell company talk 
Stun -Microsystems 
paper say 
#4# Wednesday Saudi-Arabia Prince 

Alwaleed-Bin-Talal-Bin-Abdulaziz-Al-Saud 
own percent computer maker stock 

recent buy share market total 115 million 
#5# Oracle-Corp Chairman Larry-Ellison confirm 
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March 27 form Independent: 

investor gauge -interest take-over Apple 
#6# company immedint ' avail comment 

Following the preparatory phase described above, the 
textual relevance and connectivity of each text unit is 
assessed by measuring the number of stems which the text 
unit shares with each of the other text units in the sample 
text. The ranking process comprises two main stages: the 
indexing of tokenlzad words, and the scoring of tokanized 
words in text units. 

In the first stage, all stems in the normalized text, 
which has undergone the preparatory phase described 
above, are indexed with reference to the text units in 
which they occur* For example, Apple occurs five times 
in four of the text units in the normalised text? once 
in #2#, #5# and twice in #3#- Consequently, a record 

is made \rtiere Apple is associated with these text unit 
indexes : 



<Apple {#!#.. #2#. #3#, #3#, #5#}> 

A similar record is made for each other stem in the 
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normalised text, each record being referred to as a 
stem-index record. 

A final text unit score is calculated for each text unit 
using the list of stem-index records resulting from the 
indexing stage described above. The objective of such a 
scoring process is to register how often the tokenlfced 
words from a text unit occur in each of the other text 
units. In performing this assessment , provisions are 
made for a threshold which specifies the minimal number 
of links required for text units to be considered as 
lexically cohesive. The recursive scoring procedure is 
used to generate the final scores for each text unit mafces 
use o£ the following variables . 

TRSH ,_ ls the lexical cohesion threshold 
TU is the current text unit 

LC^P is the current lexical cohesion score of TU (i.e. 

fc C rtJ i s the count of tokenised words TU shares 

with some other text unit). 
ctiQvel is the level o£ the current lexical cohesion 

score calculated as the difference between LC TU 

and TRSH 

Score is the lexical cohesion score previously 
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assigned TU (±£ any) 
Level is the level for the lexical cohesion score 
previously assigned to TU (if any) 

The scoring procedure makes uee of a scoring structure 
<level, TU, Score), and is Repeated for each text unit 
in turn, in order to produce the final score for the text 
unit TU (ie. the final value of LC TU in the scoring 
structure) ■ The procedure can then be repeated for other 
text unitrs TU. The recursive scoring procedure used in 
this exemplary embodiment is as follows. 

if LC TtJ « 0, than do nothing 

else, if the scoring structure <Level, TU, Score > 
exists , then 

if Level > CLevel, then do nothing 

else, if Level = CLevel, then the new scoring 

structure <Level, TU, Score + LC TU > 

else, if Clevel > o, then 

if Level > 0 , then new scoring structure is 
<1 # TU, Score + LC TU > 

if Leveled , then the new scoring structure 
is <1,TU, LC^> 
else if CLevelSSo the new scoring structure is 



WO $9/39282 PCT/JP99/002S9 

22 



<CLevel, TU, LC TU > 
else (if the scoring structure does not exist then) 
If CLevel > 0, then create the scoring structure <l, 
TO, LCTU> 

elise create the scoring structure < CLevel, TU r LC TU > 

The above procedure can be more readily understood by 
referring to Figure 1 # which shows the procedure in the 
form of a flow chart. In the flow chart decisions are 
indicated by diamond-shaped boxes. If the answer to the 
question within the box is yes , the procedure follows 
the arrow labelled Y at the bottom of the box, other wise 
the procedure follows the arrow labelled N at one of the 
sides of the box. 

The start of the procedure is indicated by step 10. in 
step 12 the index of the first text unit of the 
normalised text is taken and represented by #TU#. In 
«tep 14 the index of the last text unit is taken and 
represented by #B# . In the sample text given above, the 
last text unit is text unit #6#. The procedure then 
flows to fctep 16 where the lexical cohesion score of #TU# 
and #B# is calculated and assigned to LC TU , This lexical 
cohesion score is the individual score referred to above 
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and shown in Table 1 _ However , the manner in which it is 
calculated differs from that described above, and will 
now be described. 

Suppose for example, we are scoring tAxt unit #2# (ie. 
#TV# « #2# ) with a lexical cohesion threshold of 2 . First , 
all stem-index records whose stem is present in text unit 
#2# are selected, as shown below. 

<Apple £#1#, #2#, #3#, #3#, #S#» 
<company {#2#, #3#, #6#>> 
<execut £#2#, #3#» 
<look {#1#. #2#» 
<partner #2#}> 

Stems which are associated with only one text unit index 
are eliminated from this list as they simply occur in a 
text unit r but do not connect a pair of text units. 

Then a tuplet is formed consisting of the index for the 
text unit to be scored for lexical cohesion (i.e. #2#), 
and all the stem-index records whose stem occurs in that 
text unit, as shown below. 

<Apple #2#, #3#, #3#, #5#>> 



WO 99/39282 



24 



PCT/JP99;fl02S9 



<company C#2#, #3#, #6#}> 

< #2# <eacecut {#2#. #3#» > 
<look £#1#, #2#» 
<partnar #2#>> 



Next, identical index occurrences in the tuplet ore summed 
together, -to give tno following results. 





#l# 


#2# 


#3# 


&4£ 

irTfr 


444 


#6# 


#2# 


3 




4 


0 


1 


1 



Table 2 



Index occurrences referring to the text unit being 
assessed (i-*. #2#) are not counted as they do not register 
lexical cohesion (thus the second entry in the table is 
blank) . 

The oamo procedure of forming a tuplet and summing 
identical 'index occurrences is then carried out for each 
other text unit. For example, the tuplet for text unit 
#6# iss- 

<#6# <company {#2#, #3#, #s#>>> 



WO 99/39282 



PCT/JP99/00259 



25 



This Is simpler -than tlja tuple t £ or text unit #2# because 
company is the only stem which text unit #6# shares with 
any other text unit. This tuplet gives the 





#1# 


#2* 


• #3# 


#4if 
ij in 




#6# 


#6# 


0 


1 


I 


0 


0 





Table 3 



Thiff method is considerably faster than that of the prior 
art "fa ©causa it does not involve a comparison of every pair 
of text units for each word in the sample text. 

The final cohesion score of text units #2# and #6# is 
calculated by applying the scoring procedure of Figure 
1 to each row in table 2 and table 3 respectively. Scoring 
a text unit according to this procedure involves adding 
the individual scores which are either above a threshold 
(for Level 1), or below the threshold and of the same 
magnitude (for lower levels) {The use of Levels in the 
procedure is discussed below) , 

Having discussed the way in which individual lexical 
cohesion scores (for each text unit pair) are calculated 
In step 16 using tuplets, we shall return to Figure 1 to 
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follow the procedure for calculation of the f lnal lexical 
cohesion score for each text unit. However, before 
returning to Figure 1 it is noted that the simplest way 
of forming the final score would be to sum the individual 
scorafi for each text unit (i.e. for #2#'and #6#, sum each 
row in Tables 2 and 3 above), whilst ignoring all 
Individual scores below a certain threshold value. 
However, the procedure of Figure 1 goes further in that 
it determines not only a final score for each text unit, 
but also a level for each text unit, as discussed below. 

The highest level is 1 , which indicates that the greatest 
individual score (for a given text unit) is above the 
threshold. The final score for that text unit is then 
, simply the sum of all individual scores (for that text 
unit ) which are above the threshold- 

The meanings o£ level 1 and the next three levels below 
level 1, and the ways in which the final score for these 
levels is calculated, are shown in the table below. 
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Level 


Meaning of Level 


Final Score 


1 


Greatest individual score > threshold 


Sum of all individual scores above 
threshold 


' 0 


Greatest individual score » threshold 


Sum of all individual scores equal to 
threshold. 


,1 


Greatest individual score M fhre^hnld —1 


oiuzL ui, &u jntiivicLuai scores equal to 
threshold - 1 


„2 


Greatest individual score *= threshold -2 


Sum of all individual scores equal to 
threshold - 2 



It will be seen that if threshold - 0, only level 1 
exists, and the final score for a given text unit is 
simply the sum of all individual scores for that text 
unit. In fact the total number of levels is equal to the 
threshold * 1. 



Some examples of individual scores, and the levels and 
final scores they produce (by following the procedure of 
Figure 1) for a threshold of 2 are given below- 



Individual scores 


Level 


Final Score 


20201 


0 


4 


liooo 


-1 


2 


5 6200 


1 


11 


1 11 1 1 


-I 


5 



The purpose of calculating a level for each text unit is 
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to allow the text units to be ranked first according to 
level {highest level first) and second according to final 
score (highest final score first). In this way, text 
units having no individual scores above the threshold are 
not necessarily ignored in the subsequent summarisation 
process « 

Returning to Figure 1, in step 18 the procedure branches 
into two depending on whether LC TIJ o , where LC^ 11 is the 
lexical cohesion score of the text unit currently being 
considered* A lexical cohesion score of zero between two 
text units £ie- LC TU =0) indicates that the two text units 
do not share any stems. If LC TU « o then the procedure* 
gpes to step 20, As discussed below, the text unit .index 
#B# is decremented by 1 at step 2 8 during each cycle of 
the procedure. At step 20, if #B# has reached 1 then #TU# 
is incremented by 1 in step 22. That is r the next text 
unit (in this case #2#) is assigned to #TU# . In step 24 
the procedure is stopped (at step 26) if #TO# has reached 
the maximum value +1 (i,e. 6+1 = 7 for our sample text), 
otherwise control passes back to step 14* 

At step 20 , if #B# has not yet been decreased to the 
first text unit (l.e, then control passes to step 
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28, in which #B# is decremented by 1 (ie. the next lower 
text unit is assigned to #B#- 

It will therefore be seen that the effect of steps 10 to 
2B is to calculate the individual lexical cohesion scores 
for all pairs of text units . 

Returning to step la, 1£ LC TU does not equal 0, then 
control passes to step 30, which determines whether or 
not the scoring structure <Level, /TU, Score> already 
exists. The first time that step 30 is reached no 
scoring structure will already exist, and control will 
pass to step 32, which determines whether CLevel is 
greater than 0 . CLevel is the current value of Level and 
is equal to {LC^ U - TRSH) , where TRSH is the lexical 
cohesion threshold, whiah is selected in advance. In 
steps 34 and 36 values are assigned to the scoring 
structure according to the outcome of step 32 , and 
control ttien passes back to step 20 * 

At step 30, if the scoring structure already exists 
(which will always be the case except for the first time 
step 30 is reached for each value of TO, given that the 
first time step 30 is reached values are assigned to the 
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scoring structure at steps 34 and 36 as described above), 
control passes to step 3B which determines whethar Level 
(i-e. the previous v^lue of CLeveX) is graater than 
CLevel. If so, control passes back to step 20 • Otherwise, 
control passes to step 40 , which determines whether 
Level is equal to CLevel. If so, new values are assigned 
to the scoring structure in step 42, and control passes 
back to step 20, Otherwise, control passes to step ' 44 
(see Figure 2 ) , which determines whether CLevel is greater 
than 0. If so, control passes to step 46, and new values 
are assigned to the scoring structure in step 48, or step 
50 # depending on whether the level is greater than 0, and 
control passes back to step 20. At step 44, if CLevel is 
not graater than 0, control passes to step 52, which 
determines whether CLevel is less than, or equal to, 0. 
If step 52 is reached, the answer to this question should 
always be yea, so that new values are assigned tq the 
scoring structure in step 54. and control is passed back 
to step 

Following the procedure of Figure 1 for all text units 
in the sample text, and a threshold of 2, the levels and 
final scores assigned to each text unit are as follows:- 
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Text Unit 


Level 


Scare 


#1£ 


I 


3 


#2# 


1 


7 


#3# 


1 


4 


#4# 




0 


#5# 


0 


2 


#6# 


-1 


2 



These provide the following ranking of text units in terms 
of lexical .cohesion . 



Rank 


Text Unit 


Level 


Score 


1" 


#2# 


1 


7 


•2 Bd 


#3# 


1 


4 


3 rd 


#1# 


1 


3 


4 th 


#5# 


0 


' 2 


S* 


#6# 


-I 


2 


6* 


41 111 




0 



This shows the preferred order in which the text units 
will be selected in a summarisation process . It is noted 
that no level is assigned to text unit #4#, as this text 
unit shares no stems with any other text unit. 



When used with a dictionary database providing 
information about the subject domain of words the method 
described above can be slightly mpdified to detect the 
major themes and topics of a document automatically. 
As an example, the words in our sample text have the 
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following subject domain codes 



Word 



actively-adv 

business-n 

buy-v 

confinn-v 

company-n 

employcc-n. 

exeeutive-n 

friendly-adj 

group-n 

independent-adj 

interest-n 

investor-n 

look-v 

maker-n 

maxket-n 

mcrger-n 

open-adj 

OWJl-V 

paper-zi 
partner-n 
say-v 
stock-n 

take-v 
talk-n 



Associated Codes 



OR 
BZ 

MAR, MERG, MI 
CHR 

F,MI, SCG, TH 
LAB 
BZ, GOV 
FA.G. 

GROU, OR, POP 

CHT,FA 

BZ, EC, G, J, U 

IV, ON 

PHYA 

JC • 

BZ, MAR 

MERG 

CER.PFE 

MEN 

PAPP 

DA, F, MGE, TG 
CN 

AH, AM, AP, BRE, FLW 
FOO, GU, IV, PM 
EC, PG, SH, V, "WRI 
RHE 



Tlie meanings of these codes are given below:- 



> 
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CODE 



AH 

AM 

AP 

BRE 

BZ 

CER 

CHR 

CHT 

CN 



DA 
EC 
F 

FA 
FLW 
FOO 
G 

GOV 
GROU 
GU 
IV 
J 

JC 
LAB 
MAR 
MEN 

MERG 

MGE 

MI 

ON 

OR 

PAPF 

PFE 

PG 

PHYA 

PM 

POP 

RHE 

SCO 

SH 

TG 

TH 

U 

v 

WRr 



Explanation 



Animal Fanning & Husbandry ' 

Animal Names (not taxonomic terms (TAXI) 
Anthropology & Ethnology (incl racial groups) 
Breeds and Breeding 
Business & Commerce 
Ceremonies 
Christianity 

Character Traits (eg. meddlesome, mellow, outgoing) 
Communications (eg. telephony, telegraphy, - 
audiovisual, information science, radio) 
Dance & Choreography 
Economics & Finance 
Finance & Business 

Overseas Politics <& International Relations 
Flower Names: plants known primarily as flowers 
Foods; all edible items 
Sports (incl Games & Pastimes) 
Government Admin & Organisations (eg reshuffles} 
Groups of Musicians 
Guns 

Investment & Stock Markets 

Crime and the Law 

Judaco-Christian Religion 

Staff and the Workforce (inc! Labour relations) 

Marketing & Merchandising 

Mental States & Feelings 

(eg. depressed, tense, nan-plussed) 

Mergers, Monopolies, Takeovers, Joint Ventures 

Maxnage, Divorce, Relationships & Infideliry 

Military (the armed forces) 

Occupations Sc. Trades 

Organisations, Groups & Orders 

Paper & Stationery 

Banking & Personal Finance 

Photography 

Animal physiology 

Plant Names 

Pop & Rock 

Rhetoric & Oratory (eg. ad lib, eulogy, scripted) 

Scouting & Oirl Guides ' 

Clothing 

Team Games 

Theatre 

Politics, Diplomacy & Government 

Travel and Transport (incl. transport infrastructure) 
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A further embodiment relating to subject analysis in- 
volves a method which is the same as that described 
above, except that each word is first lemmatised (rather 
than stemmed), and then replaced by all of the subject 
S domain codes associated with that word 1 * The individual 
scores for pairs o£ text units are then calculated on the 
basis of shared codes rather than shared words, using 
code-index records, rather than stem-index records. 

10 However, an extra (disambiguation) step is required in 
order to avoid (or at least reduce the chances of) 
counting codes which are out of context, that is codes 
which relate to senses of the word other than the intended 
sense. The disambiguation step involves dropping , text 

15 unit indexes from the code-index records of tuplets if 
they relate to the same word as the first element (i,e. 
text unit index) of the tuplet. This requires that the 
word associated with each text unit index in each 
code-indeoc record be remembered (le» recorded) by the 

20 procedure. This procedure can be demonstrated by the 
f o 1 lo win g examp le - 

In the sample text the code BZ (Business & Commerce) is 
associated with the words; 
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executive occurring once in text units #2# and #3# 
business occurring once in text unit #3# 
market occurring once in text unit #4# 



Consequently, a code-index record can be made where the 
subject domain code BZ let associated with these text unit 
indexes , that is : 

<BZ £#2# #3# #3# #4# #5#>> 

The full list of code-index records for the sample text 
is shown below (Instances where a code occurs In a single 
text unit are removed as they do not represent lexical 
cohesion links). 



interest occurring once in text unit #S# 



<BZ 



{#2# #3# #3# 



#4# #5#» 



<CN 



{#2# #3# #3# 



#4#» 



<DA 



{#!#« #2#)> 



<F 



{#!# #2# #2$ 



#3# #6#}> 



<PA 



{#2# #5#» 



<GOV 



{#2# #3#» 



<IV {#4#-#5#}> 



<MGE {#!# #2#>> 
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<MI {#2# #3# #6#>> 
<SCG {#2# #3# #6#>> 
<TG #2#>> 
<TH <#2# #3# #6#>> 

The first tuplet (disregarding the disambiguation step 
mentioned .above) is then: 

<DA {#!# #Z#}> 
<#1# <F £#!# #2# #2# #3# #6#>>> 
<MQE #2#>> 
<TG C#l# #2#}> 

and so on for the other tuplets. 

To simplify matters, in order to illustrate the 
disambiguation step, rather than calculate the Individual 
scores for each pair of text units, we Khali consider only 
the contribution to the individual scores which is made 
by one of the codes, for example code HZ. The BZ 
components of all the tuplets are: 



<#2# <BZ #3# #3# #4# #S#>>> 

<#3# <BZ {#2# #4# #5#}>> 



WO 99/39282 PCT/JP99/00259 

37 



utap: 
sfji j 

y . ' 20 



<#4# <BZ {#2# #3# #3# #5#}>> 

<#5# <BZ {#2# #3# #3# #4# #*#}>> 

Where indexes are identical with the first index of each 
5 * tuplet are shown in strikethrough to indicate that they 
are excluded, as above* 

When allowance is made for the fact that each index is 
associated with a particular word, the B2 components of 
XO the tuplets become t 

<#2 (executive) # <BZ {#3 ( business* )# #4# #5#>> 
<#3 (executive) # <BZ {#4# #5#>>> 
<#3(business)#<BZ£#2(executive)#, #4#, #5#}>>. 
<#4# <BZ {#2# #3# #3# #5#}>> 
15 <#5# <BZ £#2# #3# #3# #4#>>> 

Where the disambiguation step is illustrated above by 
showing Indexes relating to words identical with the first 
index of each tuplet in strikethrough to indicate that 
they are excluded. The final tuplets are then: 



<#2 (executive) # <BZ <#3 (business)* #4# #5#>> 
<#3 (executive) # <BZ {#4# #5#}>> nb - #2 (executive) # 
excluded. 

<#3 (business) # <BZ {#2# #4# #5#>>> 
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<#4# <BZ {#2# #3# #3# #5#>>> 
<#S# <BZ {#2# #3# #3# #4#}>> 



The contribution made by BZ to the individual scores of 
* text unit pairs are then as follows : 





#1# 


#2# 


#3# 




#5# " 


#6# 


#1# 


** 


0 


b" 1 " 


0 


0 


0 


#2# 


0 




1 


1 


1 


0 


#3# 


0 


1 




2 


2 


0 


II '117 


0 


t 


2 




1 


0 




0 


I 


2 


1 




0 


#6# 


0 


0 


0 


0 


0 





When the same procedure is followed for certain other 
codes, such an DA, PA # GOV etc, no valid tuplets result* 
This is because the text unit indexes within the 
code -index records for these codes all relate to the same 
word* For example m the code GOV arises from the word 
"executive* which occurs in text units #2# and #3# # thus 
creating the code-index record <GOV {#2# #3#}> mentioned 
above. Because this code-index record does not form a 
valid tuplet, the "Government" sense of the word 
* executive - makes no contribution to the individual 
score* mentioned above. We have already seen that the 
Business sense of the word "executive", does make such 
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a contribution, which is the desired result because it 
is the "Business M sense at the word which is intended in 
the sample text- The method thus achieves a degree of 
disambiguation of the subject domain codes, and rejects 
codes which are out of context. 

Only instances where the words related to the same code 
differ in spelling are taken into account. This makes it 
possible to achieve higher precision in individuating 
salient themes /topics and assessing their relative 
importance. Taking the intersection of code sets for 
words with different spelling occurring in the same 
document tends to exclude contextually inappropriate 
interpretations for the words. 

However, in cases where a word in the sample text is 
associated with only one subject code, the disambiguation 
step is not carried out because no disambiguation is 
necassar^. Hence the code CN, relating to the word "say" 
remains > 

The following table shows the text unit pairs which each 
code connects * 
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CODES 


TEXT UNIT PAIRS 


B2 


2-3 2-4 2-5 3-4 3-5 3-2 3-4 3-5 4-2 4-3 4-3 4-5 5-2 5-3 5-3 5-4 


F 


1-2 1-3 1-6 2-1 2-3 2-6 3-1 3-2 6-1 6-2 


FA 


2-5 5-2 


IV 


4-5 5-4 


.CN 


3-44-3 



Only five codes form valid tuplets, all the other codes 
being excluded (as described above}. 

In total, we haves 16 text unit pairs for BZ , 10 for F, 
and 2 for FA and IV and CN. These data can be used to rank 
text units in the sample text in terms of topic aptness 
by adaptation of the procedure of Figure 1. 

The total of all individual scores for each subject domain 
code (eg. 16 for BZ, etc) can be converted into percent age 
ratios to provide a topic/theme profile of the text as 
shown in the table below:- 



50% 


BZ 


Business & Commerce 


31.25% 


F 


Finance & Business 


6.25% 


IV 


Investment & Stock Markets 


6.25% 


FA 


Overseas Politics & International Relations 


6.25% 


CN 


Communications 
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5 Pi ! 



For example, the percentage for BZ ±s calculated as 
16/(16+10+2+2+2) * 50% 

When used In a summarization system, the level -based 
differentiation of text units obtained through the 
ranking procedure of Figure 1 {whether baaed on words or 
on codes) can be made to provide an automatic indication 
of abridgement size, for example by automatic selection 
of all level 1 text units. 

Summary size can also be specified by the user, e.g. as 
a percentage of the original text size, the selected text 
units being ohoaen from among the ranked text units with 
higher levels and higher scores. 



The methods described can>also be used as indexing devices 
in various information systems such as Information 
retrieval and information extraction systems. For 
example, *Ln a database comprising a large number of texts 
20 it Lb often desirable to provide a short abstract of each 
text to assist In both manual and computer searching of 
the database. The methods described above can be used to 
generate such short abstracts automatically. 
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The ranking method described above can also be applied 
taking into acoount additional ways of assessing lexical 
cohesion # which could be used at step IB of Figure 1, 
such as; 

the presence of synonyms across text units as established 
by consulting an electronic dictionary of synonyms • 

the presence of words sharing the. same semantic indicators 
across text units as established by consulting an 
electronic dictionary, as in the example with subject 
domain codes discussed above; 

the presence of near- synonymous wards across text units 
established by estimating the degree of semantic 
similarity between word pairs, as in the method 
disclosed in British Patent Application No .9 717 5 08 . 7 - 

the presence of anaphoric links across text units, i.e. 
links between a referential expression such as a 
pronoun or a definite description (e.g. The company 
in text unit #6# # and its antecedent (Apple in text 
unit #5#>. 



> 
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•Ttes' 



The same ranking method described ±n the preferred 
embodiment can also be applied by using formatting 
commands as indicators of the relevance of particular 
types of text fragments. For example, text fragments 
5 * enclosed in formatting commands encoding titles and 
section headings such as 

<h2>Report: Apple Looking for a Partner</h2> 

10 typically contain words which can be effectively used to 
provide an indication of the main topic in a text. These 
words can be given extra weight in the above method, and 
thus be used to assign additional textual relevance to 
text units which contain them, e.g* by increasing 

15 further the lexical cohesion score of such text units 
during the ranking procedure described above m Formatting 
commands can also be selectively preserved 30 as to 
maintain as much of the page layout for the original text 
as possible. 



20 



The ranking method described above can also be applied 
by using lexnmatizing instead of stemming as a word 
tokenieation technique , or dispensing with word 
tokenization altogether* 
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The same ranking method can also be applied to texts 
written in a language other than English, by providing 

a list of stop words for the language; 

a stemmer of lemmatizer for the language; and 

any additional means for assessing lexical cohesion 

in the language such as semantic similarity and 

anaphoric links 

Figure 3 shows schematically a system suitable for 
carrying out the methods described above. The system 
comprises a programmable data processor 70 with a program 
memory 71 r for instance in the form of a read only memory 
ROM, storing a program for controlling the data processor 
70 to perform, for example, the method illustrated in 
Figures 1 and 2. The system further comprises non~ 
volatile read/write memory 72 for storing, for example, 
the list of stop words and the subject domain codes 
mentioned! above. Working or scratch pad memory for 
the data processor is provided by random access memory 
(RAM) 73. An input interface 74 is provided, for instance 
for receiving commands and data. An output interface 75 
is provided* for instance, for displaying information 
relating to the progress and result of the procedure- 
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A text sample may be supplied via the input interface 74 
or may optionally be provided, in a machine-readable store 
76. A thesaurus and/or a dictionary may be supplied in 
the read only memory 71 or may be supplied via the input 
interface 74. Alternatively, an electronic or 

machine -readable thesaurus 77 and an electronic or 
machine -readable dictionary 78 may be provided. 

The program for operating the sys.tem and for performing 
the method described hereinabove is stored in the program 
memory 71. The program memory may be embodied as 
semiconductor memory f fox instance of ROM type as 
described above. However, the program may be stored in 
any other suitable storage medium, such ap floppy disc 
71a or CD-ROM 71b. 

INDUSTRIAL APPLICABILITY 

The use of the structures according to the present 
invention considerably reduces the time taken to operate 
on the text because it is no longer necessary to count 
the number of strings shared between all possible pairs 
of text units in turn. 
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More Specifically, the degree of connectivity of a. text 
unit with all Dther text units in a text can be simply 
assessed by quantifying the elements (e.g. words) which 
each text unit shares with pairs built by associating each 
5 element in the text with the list of pointers to the text 
units in which the element: occurs. This provides a 
significant advantage in terms of processing speed when 
compared to a method such as the one described by Haey 
(1991) and Collier (1994) where the same assessment is 
10 carried out by computing all pairwise combinations of text 
units. In particular, the word-per-second processing 
rate is significantly less affected fey text size. 
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CLAIMS 

1. A method of operating on a text comprising a plurality 
of text units, each comprising one or more strings, the 
method being characterised by: 

forming a structure for each of at least some of said 
strings, in which structure a string is associated with 
each pair of text units in which the string occurs ; 

for each pair of text unite summing the number of 
occurrences of each other text unit in the same structure 
or structures so as to form an individual score for each 
pair of text units; and 

processing said individual scores for each pair of text 
units in order to form a final score for each pair of text 
units to determine how many times any string is shared 
between each pair of text units and other text units* 

2. A method of operating on a text as claimed in alalml, 
which includes the further step of ranking the text units 
on the basis of said individual scores. 
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3. A method of operating on a text as claimed, in 
alaixu 1, wherein said text units are sentences, said 
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strings aire words forming said sentences , and the method 
comprises the additional steps of removing, stop-words , 
stemming each remaining word and indexing the sentences 
prior to carrying out said summing step, and wherein said 
structures are stem-index reaords each comprising a 
stemmed word and one or more indexes corresponding to 
sentences in which said stemmed word occurs* 

4. A method of operating on a text as claimed in claim 1, 
wherein said text is associated with a word text comprising 
words , each word being associated with one or more subject 
codes representing subjects with which said word is 
associated, and wherein said strings are subject codes 
associated with said words, 

5. A method of operating on a text as claimed in claim 4, 
which comprises the further step of Keeping a record of 
the word spelling associated with each occurrence of a 
subject code in a text unit, and wherein during said 
summing step occurrences of the same subject code in a 
pair of text units are disregarded if the same word 
spelling is associated with said same subject code in said 
pair of text units. 



6. A method of operating on a text as claimed in 
claim 5, wherein said step of disregarding occur- 
renaes of subject codes is not carried, out for subject 
codes which relate to only a single word spelling in the 
word text* 

7, A method of operating on a text as claimed in 
claim 1, wherein said processing step includes 
calculating a level for each text unit, in addition 
to said final score, and wherein said level indicates 
the value of the highest of said individual scores in 
relation to a threshold value* 

a , A storage medium containing a program for controlling 
a programmable data processor (70) to perform a method 
as claimed in claim 1. 

9, A system for ranking text units in a text, the system 
comprising a data processor (70) programmed to perform 
the steps of the method of claim 1- 
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